| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| ID | 45896 | 23102 | 13403 | 1 | 11475 | 34751 | 46332 |
| Model.Year | 45896 | 2004 | 12 | 1984 | 1992 | 2015 | 2023 |
| Estimated.Annual.Petrolum.Consumption..Barrels. | 45896 | 15 | 4.3 | 0.047 | 13 | 18 | 43 |
| Fuel.Type.1 | 45896 | ||||||
| ... Diesel | 1254 | 3% | |||||
| ... Electricity | 484 | 1% | |||||
| ... Midgrade Gasoline | 155 | 0% | |||||
| ... Natural Gas | 60 | 0% | |||||
| ... Premium Gasoline | 14138 | 31% | |||||
| ... Regular Gasoline | 29805 | 65% | |||||
| City.MPG..Fuel.Type.1. | 45896 | 19 | 10 | 6 | 15 | 21 | 150 |
| Highway.MPG..Fuel.Type.1. | 45896 | 25 | 9.4 | 9 | 20 | 28 | 140 |
| Combined.MPG..Fuel.Type.1. | 45896 | 21 | 9.8 | 7 | 17 | 23 | 142 |
| Fuel.Type.2 | 45896 | ||||||
| ... | 44059 | 96% | |||||
| ... E85 | 1513 | 3% | |||||
| ... Electricity | 296 | 1% | |||||
| ... Natural Gas | 20 | 0% | |||||
| ... Propane | 8 | 0% | |||||
| City.MPG..Fuel.Type.2. | 45896 | 0.85 | 6.5 | 0 | 0 | 0 | 145 |
| Highway.MPG..Fuel.Type.2. | 45896 | 1 | 6.6 | 0 | 0 | 0 | 121 |
| Combined.MPG..Fuel.Type.2. | 45896 | 0.9 | 6.4 | 0 | 0 | 0 | 133 |
| Engine.Cylinders | 45409 | 5.7 | 1.8 | 2 | 4 | 6 | 16 |
| Engine.Displacement | 45411 | 3.3 | 1.4 | 0 | 2.2 | 4.2 | 8.4 |
| Time.to.Charge.EV..hours.at.120v. | 45896 | 0 | 0 | 0 | 0 | 0 | 0 |
| Time.to.Charge.EV..hours.at.240v. | 45896 | 0.11 | 1 | 0 | 0 | 0 | 15 |
| Range..for.EV. | 45896 | 2.4 | 25 | 0 | 0 | 0 | 520 |
| City.Range..for.EV...Fuel.Type.1. | 45896 | 1.6 | 21 | 0 | 0 | 0 | 521 |
| City.Range..for.EV...Fuel.Type.2. | 45896 | 0.17 | 2.7 | 0 | 0 | 0 | 135 |
| Hwy.Range..for.EV...Fuel.Type.1. | 45896 | 1.5 | 20 | 0 | 0 | 0 | 520 |
| Hwy.Range..for.EV...Fuel.Type.2. | 45896 | 0.16 | 2.5 | 0 | 0 | 0 | 115 |
Annex
Data Columns Detailed
| Name | data |
| Number of rows | 45896 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Make | 0 | 1 | 3 | 34 | 0 | 141 | 0 |
| Model | 0 | 1 | 1 | 47 | 0 | 4762 | 0 |
| Fuel.Type.1 | 0 | 1 | 6 | 17 | 0 | 6 | 0 |
| Fuel.Type.2 | 0 | 1 | 0 | 11 | 44059 | 5 | 0 |
| Drive | 0 | 1 | 0 | 26 | 1186 | 8 | 0 |
| Engine.Description | 0 | 1 | 0 | 46 | 17031 | 590 | 0 |
| Transmission | 0 | 1 | 0 | 32 | 11 | 41 | 0 |
| Vehicle.Class | 0 | 1 | 4 | 34 | 0 | 34 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1.00 | 23102.11 | 13403.10 | 1.00 | 11474.75 | 23090.50 | 34751.25 | 46332.00 | ▇▇▇▇▇ |
| Model.Year | 0 | 1.00 | 2003.61 | 12.19 | 1984.00 | 1992.00 | 2005.00 | 2015.00 | 2023.00 | ▇▆▆▇▇ |
| Estimated.Annual.Petrolum.Consumption..Barrels. | 0 | 1.00 | 15.33 | 4.34 | 0.05 | 12.94 | 14.88 | 17.50 | 42.50 | ▁▇▃▁▁ |
| City.MPG..Fuel.Type.1. | 0 | 1.00 | 19.11 | 10.31 | 6.00 | 15.00 | 17.00 | 21.00 | 150.00 | ▇▁▁▁▁ |
| Highway.MPG..Fuel.Type.1. | 0 | 1.00 | 25.16 | 9.40 | 9.00 | 20.00 | 24.00 | 28.00 | 140.00 | ▇▁▁▁▁ |
| Combined.MPG..Fuel.Type.1. | 0 | 1.00 | 21.33 | 9.78 | 7.00 | 17.00 | 20.00 | 23.00 | 142.00 | ▇▁▁▁▁ |
| City.MPG..Fuel.Type.2. | 0 | 1.00 | 0.85 | 6.47 | 0.00 | 0.00 | 0.00 | 0.00 | 145.00 | ▇▁▁▁▁ |
| Highway.MPG..Fuel.Type.2. | 0 | 1.00 | 1.00 | 6.55 | 0.00 | 0.00 | 0.00 | 0.00 | 121.00 | ▇▁▁▁▁ |
| Combined.MPG..Fuel.Type.2. | 0 | 1.00 | 0.90 | 6.43 | 0.00 | 0.00 | 0.00 | 0.00 | 133.00 | ▇▁▁▁▁ |
| Engine.Cylinders | 487 | 0.99 | 5.71 | 1.77 | 2.00 | 4.00 | 6.00 | 6.00 | 16.00 | ▇▇▅▁▁ |
| Engine.Displacement | 485 | 0.99 | 3.28 | 1.36 | 0.00 | 2.20 | 3.00 | 4.20 | 8.40 | ▁▇▅▂▁ |
| Time.to.Charge.EV..hours.at.120v. | 0 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| Time.to.Charge.EV..hours.at.240v. | 0 | 1.00 | 0.11 | 1.01 | 0.00 | 0.00 | 0.00 | 0.00 | 15.30 | ▇▁▁▁▁ |
| Range..for.EV. | 0 | 1.00 | 2.36 | 24.97 | 0.00 | 0.00 | 0.00 | 0.00 | 520.00 | ▇▁▁▁▁ |
| City.Range..for.EV…Fuel.Type.1. | 0 | 1.00 | 1.62 | 20.89 | 0.00 | 0.00 | 0.00 | 0.00 | 520.80 | ▇▁▁▁▁ |
| City.Range..for.EV…Fuel.Type.2. | 0 | 1.00 | 0.17 | 2.73 | 0.00 | 0.00 | 0.00 | 0.00 | 135.28 | ▇▁▁▁▁ |
| Hwy.Range..for.EV…Fuel.Type.1. | 0 | 1.00 | 1.51 | 19.70 | 0.00 | 0.00 | 0.00 | 0.00 | 520.50 | ▇▁▁▁▁ |
| Hwy.Range..for.EV…Fuel.Type.2. | 0 | 1.00 | 0.16 | 2.46 | 0.00 | 0.00 | 0.00 | 0.00 | 114.76 | ▇▁▁▁▁ |
Data summary
The table below provides an overview of the dataset.
Data cleaned overview
Cleaned Dataset
| Name | Number_of_rows | Number_of_columns | Character | Numeric | Group_variables |
|---|---|---|---|---|---|
| data_cleaned | 42240 | 18 | 8 | 5 | None |
Cleaned and Reduced Dataset
| Name | Number_of_rows | Number_of_columns | Character | Numeric | Group_variables |
|---|---|---|---|---|---|
| data_cleaned_reduced | 42061 | 18 | 8 | 5 | None |
Eigenvalues for the Principal Components Analysis
Call:
PCA(X = data_prepared, graph = FALSE)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
Variance 4.616 3.735 2.126 1.292 1.019 0.988 0.856
% of var. 25.644 20.748 11.809 7.177 5.660 5.490 4.753
Cumulative % of var. 25.644 46.392 58.201 65.378 71.038 76.527 81.280
Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
Variance 0.827 0.765 0.549 0.510 0.349 0.193 0.137
% of var. 4.594 4.250 3.047 2.834 1.941 1.071 0.759
Cumulative % of var. 85.875 90.124 93.172 96.005 97.946 99.017 99.777
Dim.15 Dim.16 Dim.17 Dim.18
Variance 0.027 0.008 0.003 0.002
% of var. 0.150 0.045 0.018 0.011
Cumulative % of var. 99.926 99.971 99.989 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr
1 | 3.334 | -1.087 0.001 0.106 | 0.065 0.000
2 | 3.387 | -0.907 0.000 0.072 | -0.079 0.000
3 | 3.522 | -1.153 0.001 0.107 | -0.031 0.000
4 | 2.761 | -1.068 0.001 0.150 | 0.012 0.000
5 | 2.713 | -0.991 0.001 0.133 | -0.022 0.000
6 | 2.713 | -0.991 0.001 0.133 | -0.022 0.000
7 | 2.847 | -1.039 0.001 0.133 | -0.066 0.000
8 | 2.876 | -1.151 0.001 0.160 | -0.019 0.000
9 | 4.850 | -1.810 0.002 0.139 | 0.346 0.000
10 | 3.313 | -1.686 0.001 0.259 | 0.231 0.000
cos2 Dim.3 ctr cos2
1 0.000 | 0.018 0.000 0.000 |
2 0.001 | 2.589 0.007 0.584 |
3 0.000 | 2.603 0.008 0.546 |
4 0.000 | 0.658 0.000 0.057 |
5 0.000 | 0.609 0.000 0.050 |
6 0.000 | 0.609 0.000 0.050 |
7 0.001 | 0.490 0.000 0.030 |
8 0.000 | 0.546 0.000 0.036 |
9 0.005 | -0.005 0.000 0.000 |
10 0.005 | 1.551 0.003 0.219 |
Variables (the 10 first)
Dim.1 ctr cos2 Dim.2 ctr cos2
make | 0.106 0.242 0.011 | -0.092 0.229 0.009 |
model_year | 0.347 2.605 0.120 | 0.032 0.028 0.001 |
vehicle_class | -0.141 0.428 0.020 | 0.051 0.071 0.003 |
drive | -0.038 0.031 0.001 | 0.003 0.000 0.000 |
engine_cylinders | 0.077 0.130 0.006 | -0.073 0.141 0.005 |
engine_displacement | 0.125 0.338 0.016 | -0.104 0.292 0.011 |
transmission | -0.498 5.376 0.248 | 0.069 0.128 0.005 |
fuel_type_1 | -0.419 3.802 0.175 | 0.223 1.328 0.050 |
city_mpg_fuel_type_1 | 0.837 15.173 0.700 | -0.308 2.532 0.095 |
highway_mpg_fuel_type_1 | 0.800 13.860 0.640 | -0.313 2.630 0.098 |
Dim.3 ctr cos2
make -0.472 10.475 0.223 |
model_year -0.120 0.679 0.014 |
vehicle_class 0.474 10.568 0.225 |
drive 0.158 1.172 0.025 |
engine_cylinders 0.755 26.797 0.570 |
engine_displacement 0.851 34.040 0.724 |
transmission -0.018 0.016 0.000 |
fuel_type_1 -0.170 1.358 0.029 |
city_mpg_fuel_type_1 -0.242 2.753 0.059 |
highway_mpg_fuel_type_1 -0.351 5.790 0.123 |
3D Biplot for 6 clusters
Warning in PCA(data_prepared, graph = FALSE): Missing values are imputed by the
mean of the variable: you should use the imputePCA function of the missMDA
package

After looking at the silhouette plot in the unsupervised learning part, we decided to provide a 3D biplot for 6 clusters, as we can also see in the elbow plot that 6 seem to be optimal in a way. In this biplot, we can observe that it is possible to divide into 6 clusters. When comparing it to the 3D biplot in the ‘results_unsupervised_learning’ part, we clearly notice that cluster 2 could be divided into four smaller clusters, which indicates heterogeneity in this cluster when using only 3 clusters. However, with 6 clusters in hand, it is more difficult to interpret the 4 distinct clusters. In addition to that, it explains the second elbow in the elbow method: at 3 clusters, we obtained optimality, but we get another steep curve between cluster 5 and 6, meaning that selecting 4 or 5 clusters would not be too much of a benefit, but adding a 6th cluster could be worth capturing. Stopping at 3 cluster still is significant for us and it makes our clustering anaylsis more interpretable than 6, that’s why we selected only 3 clusters for our analysis.